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00 ■ Abstract 

This work provides a computationally efficient and statistically consistent moment-based 
estimator for mixtures of spherical Gaussians. Under the condition that component means are 
in general position, a simple spectral decomposition technique yields consistent parameter esti- 

1 1 ' mates from low-order observable moments, without additional minimum separation assumptions 
^2 i needed by previous computationally efficient estimation procedures. Thus computational and 

information-theoretic barriers to efficient estimation in mixture models are precluded when the 
mixture components have means in general position and spherical covariances. Some connections 
' are made to estimation problems related to independent component analysis. 

> ' 

: 1 Introduction 

in 

studied and widely-used models in applied statistics and machine learning. An important special 
£Nj ■ case of this model (the primary focus of this work) restricts the Gaussian components to have 

spherical covariance matrice s; this probabilist ic model is closely related to the (non-probabilistic) 



The Gaussian mixture model ([Pearson! . 11894 ; iTitterington et all 119851 ) is one of the most well 



k- means clustering problem (jMacQueenl . 119671 ) 



The mixture of spherical Gaussians model is specified as follows. Let wi be the probability of 
choosing component i E [k] := {1,2,..., k}, let [ii,fJ*2, . . . , /i^ £ M d be the component mean vectors, 
and let a\ , o"|, . • . , <j\ > be the component variances. Define 

w := [wi,w 2 , . . .,w k } T £ R k , A := \pi\fi 2 \ ■ ■ ■ \fJ>k] G 



ndxk. 



so w is a probability vector, and A is the matrix whose columns are the component means. Let 
i £ l fc be the (observed) random vector given by 

x := Hh + 

where h is the discrete random variable with Pr(/i = i) = Wi for i G [k], and z is a random vector 
whose conditional distribution given h = i (for some i £ [k]) is the multivariate Gaussian A/*(0, erf I) 
with mean zero and covariance erf I. 
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The estimation task is to accurately recover the model parameters (component means, variances, 
and mixing weights) {(fj,i,af,Wi) : i G [k]} from independent copies of x. 

This work gives a procedure for efficiently and exactly recovering the parameters using a simple 
spectral decomposition of low-order moments of x, under the following condition. 

Condition 1 (Non-degeneracy). The component means span a /c-dimensional subspace (i.e., the 
matrix A has column rank k), and the vector w has strictly positive entries. 



T he proposed estimator is bas ed on a spectral decomposition technique (IChangl . ll996l ; lMossel and Roch . 

20061 : lAnandkumar et all l2012bh , and is easily stated in terms of exact population moments of the 
observed x. With finite samples, one can use a plug-in estimator based on empirical moments 
of x in place of exact moments. These empirical moments converge to the exact moments at 
a rate of 0(n -1 / 2 ), where n is the sample size. As discussed in Section [3l sample complexity 
bounds for accurate param eter estimation can be derived using matrix perturbation arguments 



( Anandkumar et all 2012bl ). Since only low-order moments are required by the plug-in estimator, 



the sample complexity is polynomial in the relevant parameters of the estimation problem. 



Related work. The first esti mators for the Gaussian mixture models were based on the method- 
of- moments, as introduced by Pearson ( 18941 ) (see also Lindsay and Basak . 19931 . and the refer- 
ences therein). Roughly speaking, these estimators are based on finding parameters under which 
the Gaussian mixture distribution has moments approximately matching the observed empirical 
moments. Finding these parameters typically involves solving systems of multivariate polynomial 
equations, which is typically computationally challenging. Besides this, the order of the moments 
of some of the early moment-based estimators were either growing with the dimension d or the 
number of components k, which is undesirable because the empirical estimates of such high-order 
moments may only be reliable when the sample size is exponential in d or k. Both the compu- 
tational and sample complexity issues have been addressed in recent years, at least under various 
restrictions. For instance, several distance-based estimators require that the component means 
be well-separated in Euclidean space, by at least some lar ge factor t imes t he directional stan- 
dard deviation of the individual component distribution s ( Dasgupta . 19991: Arora and Kannan . 
200ll : basgupta and Schulmanl . 120071 : IVempala and Wand . 120021 : IChaudhuri and Raol . l2008h . but 
otherwise have polynomial computational and sample complexity. Some recent moment-based es- 
timators avoid the minimum separation condition of distance-based estimators by requiring either 
computational or data resources exponential in the number of mixing components k (but not the 
dimension d) (jBelkin and Sinhal . bold ; iKalai et all bold: iMqitra and Valiantl . l20ld ) or by making 



a non-degenerate multi-view assumption ( Anandkumar et al. . 2012bl ). 



By contrast, the moment-based estimator described in this work does not require a minimum 
separation condition, exponential computational or data resources, or non-degenerate multiple 
views. Instead, it relies only on the non-degeneracy condition discussed above together with a 
spherical noise condition. The non-degeneracy condition is much weaker than an explicit minimum 
separation condition because the parameters can be arbitrarily close to being degenerate, as long 
as the sample size grows polynomially with a natural quantity measuring this closeness to degen- 
eracy (akin to a condition number). Like other moment-based estimators, the proposed estimator 
is based on solving multivariate polynomial equations, although these solutions can be found effi- 
ciently because the problems are cast as eigenvalue decompositions of symmetric matrices, which 
are efficient to compute. 
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Recent work of iMoitra and Valiant! (|201(]| ) demonstrates an information-theoretic barrier to 
estimation for general Gaussian mixture models. More precisely, they construct a pair of one- 
dimensional mixtures of Gaussians (with separated component means) such that the statistical 
distance between the two mixture distributions is exponentially small in the number of components. 
This implies that in the worst case, the sample size required to obtain accurate parameter estimates 
must grow exponentially with the number of components, even when the component distributions 
are non- negligibly separated. A consequence of the present work is that natural non-degeneracy 
conditions preclude these worst case scenarios. The non-degeneracy condition in this work is similar 
to one u sed for bypas s ing computationa l (cryptographic) barriers to estimation for h i dden M arkov 
models (jChanel . Il99fil : iMossel and Rochl . boOfil : IHsu et all 12012a! : lAnandkumar et all l2012hh . 

Finally, it is interesting to note that similar algebra ic techn i ques have been developed for cer- 
tain models in independent component analys i s (ICA) (Comonl. Il994l: Cardoso and Comonl . 19961 ; 



Hyvarinen and Qial. 2000; IComon and Jutten. 



problems (jFrieze et al 



1996; 



20 



Nguyen and Regevl . 



d: lArora et all 120121 ) and other closely related 



200 



{^). In contrast to the ICA setting, handling 



non-spherical Gaussian noise for mixture models appears to be a more delicate issue. These con- 
nections and open problems are further discussed in Section [3l 



2 Moment-based estimation 

This section describes a method-of-moments estimator for the spherical Gaussian mixture model. 

The following theorem is the main structural result that relates the model parameters to ob- 
servable moments. 

Theorem 1 (Observable moment structure). Assume Condition [7] holds. The average variance 
a 2 := Yli=i w i a i i s the smallest eigenvalue of the covariance matrix E[(x — E[x])(x — E[x]) T ]. Let 
v G R d be any unit norm eigenvector corresponding to the eigenvalue a 2 . Define 

Mi := E[x(v T (x - E[x])) 2 ] G R d , 

M 2 := E[x ® x] - a 2 I G R dxd , 

d 

M 3 := E[x®x®x]- y^(Mi et ® e, + e { ® Mi ® a + e, ® ei ® Mi) G R dxdxd 
i=l 

(where ® denotes tensor product, and {e±, e 2 , ■ ■ ■ , e^} is the coordinate basis for M. d ). Then 

k k k 

Mi = s ^w i a 2 fii, M 2 = S ^w i fj,i ® fjbi, M 3 = J ^2w i m ® m ® \n. 

i=l i=l i=l 

Remark 1. We note that in the special case where o\ = a 2 = ■ ■ ■ = o\ = a 2 (i.e., the mixture 
components share a common spherical covariance matrix), the average variance a 2 is simply a 2 , 
and M3 has a simpler form: 

d 

M 3 = E[x ® x ® x] - a 2 ^2(E[x] ® e { ® e { + e { ® E[x] ® e» + a ® e» ® E[x\) G M. dxdxd . 

i=i 

There is no need to refer to the eigenvectors of the covariance matrix or M\. 
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Proof of Theorem [0 We first characterize the smallest eigenvalue of the covariance matrix of x, 
as well as all corresponding eigenvectors v. Let p := E[x] = E[/i^] = Yli=i w ilH- The covariance 
matrix of x is 

E[(x - p) ® (x- p)] = y~^i ( (jUj - /2) <g> (/x, - A) + of I J 

i=i ^ ' 

i=l 

Since the vectors ft — p for i £ [fe] are linearly dependent (52i=i w i(Pi ~ P) = 0)> the positive 
semidefinite matrix Xa=i w^ft— p)® (ft — /i) has rank r < fc— 1. Thus, the d— r smallest eigenvalues 
are exactly a 2 , while all other eigenvalues are strictly larger than a 2 . The strict separation of 
eigenvalues implies that every eigenvector corresponding to a 2 is in the null space of X^i=i w i(f I i ~ 
p) ® (Pi — P)'-, thus v T (ft — p) = for all i £ [k]. 

Now we can express Mi, M2, and M3 in terms of the parameters Wi, ft, and af. First, 

Mi = E[x(v T (x - E[x])) 2 ] = E[(ft + z)(v T (^ - p + ^)) 2 ] = E[(/i fc + z)(« T ^) 2 ] = E[fta 2 h ], 

where the last step uses the fact that z\h ~ A/"(0, cr?J), which implies that conditioned on h, 
E[(v T z) 2 \h] = a 2 and E[z(v T z) 2 \h] = 0. Next, observe that E[z ® z] = ^ fc =1 Wjof I = a 2 /, so 

k 

M 2 = E[x ®x\- a 2 1 = E[n h ® fi h ] + E[z ® z] - a 2 1 = E[/j h ® fi h ] = ^ u>i ft ® ft. 



8=1 



Finally, for M3, we first observe that 

E[x®x®x] = E[ft ®ft® n h ] + E[fi h ® z ® z] + E[z (g) ft ® z] + E[z ® z ® ft] 

(terms such as Ef^ ® Ph ® z] and E[z ® z ® z] vanish because z\h ~ A/"(0, o- 2 I)). We now claim 
that E[/t/j = J2i=i Mi ® e% ® e^. This holds because 



E[/^ ®z0z] = E[E[^ h ®z0 z|/i]] 



E 



E 



E 



E 



E 

i=i 



e,; 



8=1 



1 <K> ej 



crucially using the fact that E[zjZj|/i] = for i ^ j and E[z 2 |/i] = a\. By the same derivation, we 
have E[z ® ft® z] = J2i=i ei® Mi ® ei and E[z ® z ® ft] = Ya=i e « ® e « ® Mi . Therefore, 

k 

M 3 = E[x®x®x]-(E[ft l ®z®z]+E[z®ft l ®z]+E[z®z®ft]) = E[ft®ft®ft] = ywjft®pj®ft 



i=l 



as claimed. 



□ 



Theorem [T] shows the relationship between (some functions of) the observable moments and the 
desired parameters. A simple estimator based on this moment structure is given in the following the- 
orem. For a third-order tensor T G ygdxdxd^ we d e fi ne T(r}) := Yli 1= i Yli 2 =i Yli 3 =i Tii^^Vh e h ® 



e{ 2 G M. dxd for any vector rj G 
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Theorem 2 (Moment-based estimator). The following can be added to the results of Theorem^ 
Suppose rj T fj,i,rj T [i2, ■ ■ ■ ,il T l^k are distinct and non-zero (which is satisfied almost surely, for in- 
stance, if n is chosen uniformly at random from the unit sphere in Mr). Then the matrix 

M GMM (v) ■= Ml 1/2 M 3 ( V )Ml 1/2 
is diagonalizable (where t denotes the Moore-Penrose pseudoinverse); its non-zero eigenvalue / 

1/2 

eigenvector pairs (Ai, v\), (A 2 , v 2 ), ■ ■■ , (\k,v k ) satisfy Aj = n T fi n ^ and M 2 ' Vi = Si^/w^^i) for 
some permutation it on [k] and signs s\,s 2 , ••• ,Sfc G {±1}- The fa, a 2 , and Wi are recovered (up 
to permutation) with 



^(i) = %o— M l l2 vu o-} = —ejA^Mx, Wi = ejA%x\ 



n T M^ Vi 
Proof. By Theorem HJ 

M 1 =Adiag(al,al...,al)w, M 2 = Admg{w)A T , M 3 (t/) = Adiag(w)D 1 (r ] )A T , 

where D^rj) := diag(r/ T /Ui, n T fi 2 , ■ ■ ■ ,if fa.). 

Let USR T be the thin SVD of Adiag(w) 1 / 2 (U € R dxk , S £ R kxk , and R £ R kxk ), so M 2 = 
US 2 U T and = C/5 _1 ?7 T since ^4diag(u;) 1//2 has rank fc by assumption. Also by assumption, 

the diagonal entries of Di(n) are distinct and non-zero. Therefore, every non-zero eigenvalue of 
the symmetric matrix Mgmm(i) = UR T Di(n)RU T has geometric multiplicity one. Indeed, these 
non-zero eigenvalues Aj are the diagonal entries of Di(rj) (up to some permutation it on [k]), and 
the corresponding eigenvectors Uj are the columns of U R T up to signs: 

Ai = ?? T /V(i) and Vi = s i UR T e 7T ^. 

Now, since 

,,1/2 , Aj V T ^(i) 



in V i — s i\/ W Tr(i)l- l 'TT<i)-i i in — , x , : 

v rj T M 2 ' Vi Siy/^Mi)^ >*•(<) H-s/w^ij 



it follows that 



= ~i72 M 2 1/2 ^> * G 



rj T Mz' Vi 

The claims regarding a 2 and are also evident from the structure of M\ and K[x] = Aw. □ 

An efficiently computable plug-in estimator can be derived from Theorem [2 We state one such 
algorithm (called LearnGMM) in Appendix [C] for simplicity, we restrict to the case where the 
components share the same common spherical covariance, i.e., o\ = o~\ = ■■■ = o\ = a 2 . The 
following theorem provides a sample complexity bound for accurate estimation of the component 
means. Since only low-order moments are used, the sample complexity is polynomial in the relevant 
parameters of the estimation problem (in particular, the dimension d and the number of mixing 
components k). It is worth noting that the polynomial is quadratic in the inverse accuracy param- 
eter 1/e; this owes to the fact that the empirical moments converge to the population moments at 
the usual n -1 / 2 rate as per the central limit theorem. 
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Theorem 3 (Finite sample bound). There exists a polynomial poly(-) such that the following holds. 
Let M 2 be the matrix defined in Theorem^ and q[M 2 ] be its t-th largest singular value (for t G [k]). 
Let 

bmax '■ — maXjgjk] ||/ij|| 2 and u> m ; n : — min^g^j u>j. Pick any e, 5 G (0, 1). Suppose the sample size 
n satisfies 

n > poly (d, k, 1/e, log (1/5) , l/w min , ?1 [M 2 ] / ft [M 2 ] , 6 max /ft [M 2 ] , <r 2 /ft [M 2 

T/ien mt/i probability at least 1 — 5 over the random sample and the internal randomness of the 
algorithm, there exists a permutation tt on [k] such that the {/tj : i £ [k]} returned by LearnGMM 
satisfy 



I/Mi) ~ ^ll 2 - [ 1 1 A 4 ' 



| 2 + 



/or a// i G [ft] . 



The proof of Theorem [3]is given in AppendixO It is also easy to obtain accuracy guarantees for 
estimating a 2 and w. The role of Condition[T]enters by observing that [M 2 ] = if either rank(A) < 
ft or u> m i n = 0, as M 2 = Ad\a,g(w)A T . The sample complexity bound then becomes trivial in this 
case, as the bound grows with l/<j/ c [M 2 ] and l/io mm - Finally, we also note that LearnGMM is just 
one (easy to state) way to obtain an efficient algorithm based on the struct ure in Theorem [H It is 



also possible to use, for instance, simultane ous diagonalization techniq ues (jBunse-Gerstner et al. 



1993) or orthogonal tensor decompositions ( Anandkumar et al. . 2012al ) to extract the parameters 



from (estimates of) M 2 and M3; these alternative methods are more robust to sampling error, and 
are therefore recommended for practical implementation. 



3 Discussion 

Multi-view methods and a simpler algorithm in higher dimensions. Some previous work 
of the authors on moment-based e stimators for the Gaussian mixture model relies on a non- 
degenerate multi-view assumption ( Anandkumar et al. . 2012bl ). In this work, it is shown that 



if each mixture component i has an axis-aligned covariance Ui := diag(cr^ i , a\ i , . . . , o\ i ), then un- 
der some additional mild assumptions (which ultimately require d > k), a moment-based method 
can be used to estimate the model parameters. The idea is to partition the coordinates [d] into 
three groups, inducing multiple "views" x = (2:1, x 2 , 2:3) with each xt G M dt for some dt > k such 
that xi, x 2 , and X3 are conditionally independent given h. When the matrix of conditional means 
A t := \E[x t \h = l]\E[x t \h = 2]| • • • \E[x t \h = ft]] € R dtXk for each view t G {1,2,3} has rank ft, 
then an efficient technique similar to that described in Theorem [2] will recover the parameters. 
Therefore, the problem is reduced to partitioning the coordinates so that the resulting matrices At 
have rank k. 

In the case where each component covariance is spherical (27j = erf I), we may simply apply 
a random rotation to x before (arbitrarily) splitting into the three views. Let x := Qx for a 
random orthogonal matrix B G ]R dxrf , and partition the coordinates so that x = (x\, x 2 , X3) with 
xt G M. dt and dt > k. By the rotational invariance of the multivariate Gaussian distribution, the 
distribution of x is still a mixture of spherical Gaussians, and moreover, the matrix of conditional 
means A t := [E[£t|/t = l]|E[x t |/i = 2]\ ■ ■ ■ \E[xt\h = k]] G M. dtXk for each view xt has rank k with 
probability 1. To see this, observe that a random rotation in M. d followed by a restriction to dt 
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coordinates is simply a random projection from M. d to M. dt , and that a random projection of a linear 
subspace of dimension k (in particular, the range of A) to M dt is almost surely injective as long as 
dt > k. Therefore it is sufficient to require d > 3k so that it is possible to split x into three views, 
each of dimension dt > k. To guarantee that the A;-th largest singular value of each At is bounded 
below in terms of the fc-th largest singular value of A (with high probability), we may require d to 
be somewhat larger: 0{k\ogk) certainly works (see Appendix[B]), and we conjecture c- k for some 
c > 3 is in fact sufficient. 



Spectral decomposition approaches for ICA. The Gaussian mix ture model shares some simi- 



lariti e s to a standard model for independent component an alysis (ICA) (|Comonl . ll994l ; ICardoso and Comon 



1996 : Hyvarinen and Oja . 2000; Comon and Jutten . 2010l ). Here, let h G M. k be a random vector 



with independent entries, and let z G K be multivariate Gaussian random vector. We think of h 
as an unobserved signal and z as noise. The observed random vector is 

x := Ah + z 

for some A G M fcxfc , where h and z are assumed to be independent. (For simplicity, we only consider 
square A, although it is easy to generalize to A G M. dxk for d > k.) 

In contrast to this ICA model, the spherical Gaussian mixture model is one where h would take 
values in {e±, e^, ■ ■ ■ , e^}, and the covariance of z (given h) is spherical. 

For ICA, a spectral decomposition approach related to the one described in Theorem [2] can be 
used to estimate the columns of A (up to scale), without knowing the noise covariance E[zz T ]. Such 
an estimator can be obtained from Theorem U] using techniques commonplace in the ICA literature; 
its proof is given in Appendix |A] for completeness. 



Theorem 4. In the ICA model described above, assume E[/tj] = 0, E[/tf ] = 1, and Ki 
(i.e., the excess kurtosis is non-zero), and that A is non-singular. Define /: M fc — 



E[hf) 
• by 



■3/ 



f(rj) := 12- 1 (m 4 (r/) - 3m 2 (r]) 2 ) 



where m p (r]) := 
M are distinct. 



E[(r] T x) p ]. Suppose^ 
Then the matrix 



G M fc and ijj G M fe are such that 



(</>Vi) 2 (£_E2? 

W Ml) 2 ' W> >2) 2 ' 



U 1 Mfc) 2 ^ 



M lCA {^) ■= (V 2 /(0))(V 2 /W)' 



is diagonalizable; the eigenvalues are 7^r~p , |^ T ^ 2 j 2 » ■ ■ ■ 
plicity one, and the corresponding eigenvectors are \i\^\i2- 



(^ T Atfc) 2 an< ^ eac ^ have geometric multi- 
. . . , (j,k (up to scaling and permutation). 



Again, choosing <p and ip as random unit vectors ensures the distinctness assumption is satisfied 
almost surely, and a finite sample analysis can be given using standard matrix perturbation tech- 



niques jAnanHkumar et all M l. A nu mber of related dete rministic algorithms ba sed on ake- 



braic techniques are discussed in the text of lComon and Juttenl ». Recent work of krora et al 
(|2012l ) provides a finite sample complexity analysis for an efficient estimator based on local search. 



Non-degeneracy. The non-degeneracy assumption (Condition [T]) is quite natural, and its has 
the virtue of permitting tractable and consistent estimators. Although previous work has typically 
tied it with additional assumptions, this work shows that they are largely unnecessary. 
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One drawback of Condition Q] is that it prevents the straightforward application of these tech- 
niques to certain problem domains (e.g., automatic speech recognition (ASR), where the number 
of mixture components is typically enormous, but the dimension of observations is relatively small; 
alternatively, the span of the means has dimension < k). To compensate, one may require mul- 
tiple yiews, which are granted by a number of mo dels, including hidden Markov models used in 



and combining these views in a tensor prod- 



ASR (iHsu et all l2012al: lAnandkumar et al. 
uct fashion (jAllman et all [2009;)- This increases the complexity of the estimator, but that may 
be inevitabl e as estimation for cert ain non-singular models is conjectured to be computationally 



intractable ( Mossel and Roch . 20061 ). 
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A Connection to independent component analysis 

Proof of Theorem [^J It can be shown that 

m 2 {rj) = E[(r] T Ah) 2 } + E[(n T z) 2 }, m 4 (r?) = E[(r? T Ahf] - 3E[(r/ T Ah) 2 } 2 + 3m 2 (r/) 2 . 
By the assumptions, 

k 

E[( V T Ah) 4 } = Y,(v T ^) 4 nhf} + 35> T ^)V/^) 2 

1=1 i^j 

k 

i=l i,j 
k 

= Y,^(v T ^) 4 +mv T Ah) 2 } 2 , 

1=1 

and therefore 

k 

f( V ) = 12" 1 {E[( V T Ah) 4 } - 3E[( V T Ah) 2 } 2 ) = 12" 1 £ k^^) 4 . 

i=i 

The Hessian of / is given by 



i=i 



Define the diagonal matrices 

K := diag(Ki, k 2 , • • • , AjO?) := diag((r/ T ^i) 2 , (r/ T /i 2 ) 2 , . . . , (f? T fi k ) 2 ) 

and observe that 

V 2 /(7?) = AKD 2 (n)A T . 
By assumption, the diagonal entries of D 2 (cf))D 2 (ip)~ 1 are distinct, and therefore 

ATica(^) = (V 2 /(0))(V 2 /^)) _1 = AD 2 {cj ) )D 2 ^)- 1 A- 1 
is diagonalizable, and every eigenvalue has geometric multiplicity one. □ 
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B Incoherence and random rotations 



The multi-view technique of Anandkumar et al. ( 2012bl ) can be used to estimate mixtures of product 



distributions, which include, as special cases, mixtures of Gaussians with axis-aligned covariances 
£i = diag (af { , a\ i , . . . ,a^ { ). Spherical covariances Si = of I are, of course, also axis-aligned. The 
idea is to randomly partition the coordinates [d] into three groups, inducing multiple "views" x = 
(x\, X2, X3) with each xt G M. dt for some d t > k such that x±, X2, and X3 are conditionally independent 
given h. When the matrix of conditional means At ■= \E[xt\h = l]\K[xt\h = 2]| • • • |E[xt|/i = k]] G 
^d t xk £ Qr gg^h v i ew f £ {1 ( 2 ; 3} has rank k, then an ef ficient technique sim i lar to t hat described in 
Th eorem [2] will recover the parameters (for details, see I Anandkumar et all B . 



Anandkumar et al.l ( 2012bl ) show that if A has rank k and also satisfies a mild incoherence 



condition, then a random partitioning guarantees that each At has rank k, and lower-bounds the 
k-ih largest singular value of each At by that of A. The condition is similar to the spreading 
condition of Chaudhuri and Raol ( 20081 ). 



Define coherence(vl) := max i£ ^{ej IlAei} to be the largest diagonal entry of the ortho-projector 
Ha to the range of A. When A has rank k, we have coherence(j4) € [k/d, 1]; it is maximized when 
range(j4) = span{ei, eg, . . . , e^} and minimized when the range is spanned by a subset of the 
Hadamard basis of cardinality k. Roughly speaking, if the matrix of conditional means has low 
coherence, then its full-rank property is witnessed by many partitions of [d]; this is made formal in 
the following lemma. 

Lemma 1. Assume A has rank k and that coherence(A) < (e 2 /6)/ ln(3k/5) for some e,5 G (0, 1). 
With probability at least 1 — 5, a random partitioning of the dimensions [d] into three groups (for 
each i G [d], independently pick t G {1,2,3} uniformly at random and put i in group t) has the 
following property. For each t G {1, 2, 3}, the matrix A t obtained by selecting the rows of A in group 
t has full column rank, and the k-th largest singular value of A t is at least y(l — s)/3 times that 
of A. 

For a mixture of spherical Gaussians, one can randomly rotate x before applying the random 
coordinate partitioning. This is because if G M. dxd is an orthogonal matrix, then the distribution 
of x := Ox is also a mixture of spherical Gaussians. Its matrix of conditional means is given by 
A := OA. The following lemma implies that multiplying a tall matrix A by a random rotation © 
causes the product to have low coherence. 

Lemma 2 (|Hsu et all 1201 ll ) . Let A £ R. dxk be a fixed matrix with rank k, and let G R. dxd 



be chosen uniformly at random among all orthogonal d x d matrices. For any n £ (0,1), with 
probability at least 1 — 77, ^ e matrix A := OA satisfies 

coherence(I) < ^5^^. 

d(l-l/(4d)-l/(360 ( i 3 )) 2 

Take n from Lemma [2] and e, 5 from Lemma Q] to be constants. Then the incoherence condition 
of Lemma [1] is satsified provided that d > c • (/clog k) for some positive constant c. 

C Learning algorithm and finite sample analysis 

In this section, we state and analyze a learning algorithm based on the estimator from Theorem 
which assumed availability of exact moments of x. The proposed algorithm only uses a finite sample 
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to estimate moments, and also explicitly deals with the eigenvalue separation condition assumed 
in Theorem [2] via internal randomization. 

C.l Notation 

For a matrix X G M mxm , we use St[X] to denote the i-th largest singular value of a matrix X, and 
1 1 X || 2 to denote its spectral norm (so ||X|| 2 = ^[X]). 

For a third-order tensor Y G R™*™*™ an d U, V, W G M mxn , we use the notation y[J7, V, TV] G 
R nxnxn to denote the third-order tensor given by 

I,j2,j3 ~ ^1 ^1 Jl^2j2^3j3^1,*2,«3' 

Note that this is the analogue of U T XV G M nxn for a matrix X G M mxm and U,V G M mxn . For 
y G K mxmxm ) we use ||V||2 to denote its operator (or supremum) norm ||V||2 : = sup{|y[u, v , w]\ : 
u,v,w G M m , 1 1 1 1 2 = ||u H2 = \\ w \\2 = !}• 

C.2 Algorithm 

The proposed algorithm, called LearnGMM, is described in Figure [TJ The algorithm essentially 
implements the decomposition strategy in Theorem [2] using plug-in moments. To simplify the 
analysis, we split our sample (say, initially of size 2n) in two: we use the first half for empirical 
moments (/} and M.2) used in constructing a 2 , M2, W, and B; and we use the second half for 
empirical moments (W T fi and A^3[W, W, W] used in constructing M3 \W, W, W]. Observe that 

this ensures M3 is independent of W. 

Let {(xi, hi) : i G [n]} be n i.i.d. copies of (x, h), and write S := {xi,x 2 , ■ ■ ■ , x n }. Let S be an 
independent copy of S. Furthermore, define the following moments and empirical moments: 

fi:=E[x], M2 ■■= E[xx T ], M3 :=M[x®x®x], 

xGiS xGiS x£S rr&S 

So 5 represents the first half of the sample, and 5 represents the second half of the sample. 
C.3 Structure of the moments 

We first recall the basic structure of the moments jx, M 2 , and .M3 as established in Theorem [21 for 
simplicity, we restrict to the special case where o\ = a\ = ■ ■ ■ = o\ = a 2 . 

Lemma 3 (Structure of moments). 

k 

h = ^Wim, 
1=1 

k 

M 2 = ^2 WiHinJ + a 2 1, 
i=i 

k d 

M3 = w i^i ® Hi® Hi + v 2 ^2 ® &j &> ej + ej ® [i® ej + ej ® ej ® /ij . 
i=i j=i 
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LearnGMM 

1. Using the first half of the sample, compute empirical mean fi and empirical second-order 
moments M.2- 

2. Let a 2 be the fc-th largest eigenvalue of the empirical covariance matrix M2 — A/i T - 

3. Let M2 be the best rank-A; approximation to M.2 — o 2 l 

M 2 := arg min \\(M 2 - v 2 I) - X\\ 2 

XeM dxd :rank(X)<fe 

which can be obtained via the singular value decomposition. 

4. Let U € M. dxk be the matrix of left orthonormal singular vectors of M 2 . 

5. Let W := U(U T M2U) jf1 / 2 , where denotes the Moore-Penrose pseudoinverse of a 
matrix X. 

Also define B := l>(l>M 2 £/) 1/2 . 

6. Using the second half of the sample, compute whitened empirical averages W T £1 and 
third-order moments AtafW, W, W\. 

7. Let M 3 [W,W,W] := Ms[W, W, W] - a 2 Eli ( (W T p.) ® (W T ej) ® (W T ej) + (W T ej) ® 
(W T fa) (W T ei ) + (W T ei) ® (W T ei ) (W T fa)) . 

8. Repeat the following steps t times (where t := [log 2 (l/<5)] for confidence 1 — 5): 

(a) Choose 6 G R fe uniformly at random from the unit sphere in R fc . 

(b) Let {(vi, Xi) : i 6 [k]} be the eigenvector /eigenvalue pairs of M3[W, W, WO}. 

Retain the results for which min({|Aj — \j\ : i 7^ j} U {|Aj| : i 6 [k]}) is largest. 

9. Return the parameter estimates a 2 , 

fa ■= -^-Bvi, ie [k], 

w ■■= [A1IA2I • • • IAfe] f A- 

Figure 1: Algorithm for learning mixtures of Gaussians with common spherical covariance. 
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C.4 Concentration behavior of empirical quantities 

In this subsection, we prove concentration properties of empirical quantities based on S; clearly 
the same properties hold for S_. 

Let Si := {xj G S : hj = i} and lii := |<Sj|/|<S| for i G [k]. Also, define the following (empirical) 
conditional moments: 

Hi := M[x\h = i], M 2 ,i ■= E,[xx T \h = i], M 3i j := E[x <g> x ® x\h = i], 



^ := 1T7 ^2 x ' M 2 ,i := -r^r ^2 xx T , M 3 ,i ■= t^7 ^ x ® x 8) x. 

xG5i xg5i a;G<Si 

Lemma 4 (Concentration of proportions). Pick any 5 E (0, 1/2). VFii/j probability at least 1 — 2(5, 



| ti , i _ mi |<./ 2 -"-^M2^) + 2 1 n(2^) | 
V n 3n 

»*-^)J < . 

i=l 

Proof. The first inequality follows from Bernstein's inequality and a uni on bound. The se cond in- 
equality follows from a simple application of McDiarmid's inequality (see lHsu et al.l . l2012al . Propo- 
sition 19). □ 

Lemma 5 (Concentration of per-component empirical moments). Pick any 5 G (0, 1) and any 

matrix R G M. dxr of rank r. 

1. First-order moments: with probability at least 1 — 5, 



„ T/ x r + 2Jrln(k/5) + 2\n(k/5) 

y /(',/; 

Second-order moments: with probability at least 1 — 5, 
\\R T {M 2 ,i - M 2 ,i)Rh < o- 2 \\R\\l\ 



128(r In 9 + ln(2Jfe/<f)) 4(r In 9 + ln(2fc/<5)) \ 
u)jn Win J 



j-9 II f? T I, .ion ,/ r + 2yMn(2fc/<S) + 21n(2fc/<5) 

+ 2cr LR Ui h it 21/ , Vi G \k\. 



5. Third-order moments: with probability at least 1 — 5, 
\\(M3,i-M 3 ,i)[R,R,R]\\ 2 <o- 3 \\R\\ 3 2 



108e 3 [r lnl3 + ln(3A;/(5)l 3 



Win 



+ 3(7 2|| jR T /i ,| 2 || jR ||2[ / l28(rln9 + ln(3fc/^)) | 4(r In 9 + ln(3fc/<5)) \ 

„„t „2,i«,i / r + 2J7k^k/5) + 2 ln(3fc/<$) w ril 

+ 3<r i? T W | J2 2 y L - LJ Vie A;. 

V Win 
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Proof. We separately consider first-, second-, and third-order moments. Throughout, we let the thin 
SVD of R be given by R = USV T , where U 6 M dxr has orthonormal columns, and ||V<S||2 = ||-R||2- 
First-order moments. Observe that (win/a 2 )\\U T (fii — Mi)||i is distributed as the sum of r indepen- 
dent x 2 random variables, each with one degree of freedom. Thus, Lemma [18] and union bounds 
imply 



Pr 



3i € [k] . \\U 



, 2 f r + 2^rln(k/5)+2ln(k/5) \ 

m - Hi) 2 > ° : 

V Win J 



< 8. 



Second-order moments. Since .M2 i = o~ 2 I + Hi^J , it follows by the triangle and Cauchy-Schwarz 
inequalities that 

-J- V (^(xj-fuXxj-fnyU -a 2 l) 
Win V / 

+ 2||i? T / u l ||2||i? T (A t -/U J )|| 2 . 
A tail bound for the first term follows from Lemma [T9"l combined with a union bound: 



\R T (M 2 ,i- M 2 ,i)Rh < \\R\\l 



Pr 



3i e [k] . 



— V ([/ T (^-/i J )(x j -/i 4 ) T C/-a 2 / 



Win 



j£[n]:Xj£Si 



> a" 



128(rln9 + In(Jfe/*)) M 4(r In 9 + ln(fc/<5)) j 



Win 



< 5. 



The second term is handled as above. 
Third-order moments. It can be checked that 

d 



M34 = jUj /ij Mi + a 2 I /j; e t e t + e t Mi e L + e t e t ^ j 



(similar to Lemma [3]) and 
■Ms,! - A4 3 ,t 



Mi) (Xj - /ii) (Xj - /ii) 



+ E ® (Xj - Mi) (Xj - ^i) - O 2 ^ Mi ® e i ® e t) 

je[n]:xje5i 



1=1 



+ E ( (xj - ^i) Hi (gj - ^i) - a 2 ^ e t \ij e, 

d 

+ E ((xj - Hi) ® ( x j ~ Hi) M» - 0-2 E e t e t Hi) 



j£[n]:Xj£Si 



y~] Hi® Hi® ( x j - Hi) + E & ® ~ Mi) ® M« + E ( x -?' ~~ ^) ® Mi ^ Mi J 
nl:xwe<Si jG[n]:a;$eSj jE[n]:a;,-E«Sj / 
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Therefore, by the triangle and Cauchy-Schwarz inequalities, 



\\(M 3 ,i-M 3 ,i)[R,R,R}\\ 2 < \\R\\l 



-r— ^ U T (xj - m) ®U T (xj - m) ®U T (xj - m) 

ItlrT) ^ 



Win 



j£[n]:Xj£Si 



+ 3||i2 T /ii|| 2 



-J- V R T ((x j -n, i )(x j -a i ) T -a 2 l)R 
Win / 



j<=[n]:Xj£Si 



+ 3\\R T Hi\\ 2 2 \\R T (Hi ~ Hi)h- 
A tail bound for the first term is given by Lemma [2TI combined with a union bound: 



Pr 



3i E [k] . 



E\jJ T (xj - Hi) ® £/ T (>,' - Hi) ® ^(^j - Mi) 



> (7° 



108e 3 \r lnl3 + ln(A;/<f)"| ; 
Win 



< 5. 



The other terms are handled as per above. 



□ 



Lemma 6 (Accuracy of empirical moments) . FixamatrixR 6 K rfxr . Define Bi^r := maXj^ny ||-R T /Ltj||2, 
&2,R ■= max ig[fc ] ||-R T A^ 2 ,i-R||2, B3.fi := max ie[fc] ||X 3; i[i?, i?, i2] || 2 , <?i,_r := max ie[A .] ||i? T (/ii - Mi)l|2, 
£ 2 ,R ■= m.ax ie [ k ]\\R T (M2,i - M 2 ,i)R\\2, £ 3 ,R ■= rnaxj € [ fc] ||X 3 ,i[-R, R, R] ~ M 3 ,i[R, R, R] \\ 2 , and 
^:=(£ti(^-^) 2 ) 1/2 - Then 

\\R T (H ~ /-OII2 < (1 + Vk~£ w )£i,R + VkB hR £ w ; 
\\R T {M 2 - M 2 )R\\ 2 < (1 + Vk£ w )£ 2 , R + VkB 2 , R £ w ; 
\\(M 3 -M 3 )[R,R,R}\\ 2 < (l + Vk£ w )£ 3 , R + VkB 3 , R £ w . 

Proof. We just show the third claimed inequalitiy, as the others are similar. Write as shorthand 
% := M 3:i [R,R,R] and T { := M 3 ,i[R,R,R]. Then 



i=l 



i=l 



+ 



8=1 


2 8=1 


2 i=i 


8=1 


k 

\Ti - Ti\\ 2 + } j \w i - Wi\\\Ti\ 
i=l 


2 + y~] |wj - lOj Tj - Tj 2 

8=1 


< max lb 
i6[fc] 


n i - ?i 2 + vfc||to - ro b max 

»e[Jfc] 


l^ilb + Vk ii — to 2 max 

ie[Jfc] 


= £3,R + 


\fk£ w B 3jR + Vk£ w £ 3) R 





where the first and second steps use the triangle inequality, and the second step uses Holder's 
inequality. □ 
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C.5 Estimation of a 2 , M 2 , and M 3 

The covariance matrix can be written as M 2 — / U A* T > and the empirical covariance matrix can be 
written as A4 2 ~ AA T - Recall that the estimate of a 2 , denoted by a 2 , is given by the k-th largest 
eigenvalue of the empirical covariance matrix M2 — AA T j and that the estimate of M 2 , denoted by 
M 2 , is the best rank-/c approximation to Ai 2 — a 2 1. Of course, the singular values of a positive 
semi-definite matrix are the same as its eigenvalues; in particular, a 2 = SfcpW 2 — AA T ]- 

Lemma 7 (Accuracy of a 2 and M 2 ). 

1. \a 2 - o 2 \ < \\M 2 - M 2 \\ 2 + 2||/x|| 2 ||A - Mb + ||A - Hli- 

2. \\M 2 - M2H2 < 4[| JCT 2 - M 2 \\ 2 + 4||/i|| 2 ||/i - /x|| 2 + 2||/t - iif 2 . 

Proof. Using Weyl's inequality ([Stewart and Sun , Il990l . Theorem 4.11, p. 204), we obtain |?fc[A^2 — 

fijj, T \-<; k [M 2 -fi^ T }\ < \\{M 2 -fifi T )-(M 2 -w T )\\ 2 < ||A?2-A / l2||2 + 2[|/i[| 2 ||A-A»ll2 + ||A-Mlli- 
The first claim then follows by observing that g% [M 2 — ^^ T ] = o~ 2 as per Theorem [TJ 

For the second claim, observe that ?fc + i(.M 2 — °~ 2 I) = as Ai 2 — a 2 I has rank k. Therefore 
*+i(-M 2 - <r 2 I) = | ft+1 (A? 2 -jr 2 I) ~ Sk + i(M 2 - a 2 I)\ < \\(M 2 - a 2 iy- (M 2 - a 2 I)\\ 2 , again 
using Weyl's inequality. Since M 2 is the best rank-/c approximation to A4 2 — d~ 2 I, it follows that 
\\M2-(M2-v 2 I)h < ^k+i(M 2 -a 2 I) < \\{M 2 -a 2 I)-{M 2 -a 2 I)\\ 2 . Therefore \\M 2 -M 2 \\ 2 < 
\\(M 2 -a 2 I) - {M 2 -a 2 J)\\ 2 + ||M 2 - (M 2 -a 2 I)\\ 2 < 2\\(M 2 -a 2 I) - {M 2 -a 2 I)\\ 2 < 2\\M 2 - 
M 2 \\ 2 + 2\a 2 -a 2 \< A\\M 2 - M 2 \\ 2 + 4||/i|| 2 ||A - nh + 2 IIA ~ /"111- n 

Recall that the estimate of M3, denoted by M3, is given by M.3 — a 2 5^f=i(/* ® e « ® e * + e « ® 

A ® e, + ei ® ej ® A)- 



Lemma 8 (Accuracy of M3). For any matrix R £ 



pdxr 



||M 3 [Ji,i2,JJ]-M3[Ji,i2,Ji]||2 < HAjst-R, it, Jg] - A<3[-R, -R, -R] [I2 

+ 3|| J R|||(||i? T (A-M)l| 2 + ||i?V||2) 

{\\M 2 - M 2 \\ 2 + 2||^|| 2 ||A - iAW + HA - Mill) 

+ a 2 \\R\\ 2 \\R T (^-fi)\\ 2 . 

Proof. Let G := Y^i=\ ( A ® e « ® e « + e « ® A ® e « + e « ® e « ® A and G := X^=i 0" ® e « ® e « + e « ® A 4 ® 
ei + ej ® e« (g> /i). Then 

|| (M 3 - M 3 )[R, R, R]\\ 2 

= \\(M3-M 3 )[R,R,R] - (a 2 - a 2 ){G - G)[R, R, R] - (a 2 - a 2 )G[R, R, R] - a 2 (G - G)[R, R, R}\\ 2 
<\\{Ms- M 3 )[R, R, R]h + |o- 2 - o- 2 |||(G - G)[R, R, R]\\ 2 + \a 2 - a 2 \\\G[R, R, R]\\ 2 
+ a 2 \\(G-G)[R,R,R]\\ 2 . 

Observe that by the triangle inequality, \\G[R, R, R] || 2 < 3||i?|||||i? T /i|| 2 and \\(G - G)[R, R, R]\\ 2 < 
3||i?|||||ii T (A - /x) H2 - Furthermore, by LemmaEl we have \a 2 - a 2 \ < \\M 2 - M2W2 + 2||//|| 2 ||A — 
/x 1 1 2 — |— 1 1 /i — n\\ 2 . Therefore the claim follows. □ 
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C.6 Properties of projection and whitening operators 

Recall that U E M dxfc is the matrix of left orthonormal singular vectors of M 2 , and let S E M fcxfc be 
the diagonal matrix of corresponding singular values. Analogously define U and S relative to M 2 . 
Define £ Ma := \\M 2 - M 2 \\ 2 /q k [M 2 \. 

Lemma 9 (Properties of projection operators). Assume £m 2 — 1/3- Then 
1. (1 + £ M2 )S h U T M 2 U = Sh(l- £m 2 )S y 0. 



2. s k [U T U] > ^l-(9/4)^ 2 > . 

3. s k [U T M 2 U] > (1 - (9/4)£ 2 M2 ) qk [M 2 ] > 0. 
I \\(I-UU T )UU T \\ 2 < (3/2)£ M2 . 

Proof. By the assumptions that £m 2 5; 1/3 an d M 2 is symmetric positive definite, we have 

|q [M 2 ] -st[M 2 ]\ < \\M 2 -M 2 \\ 2 < £ M2 q k [M 2 ] , Vt E [k] 
by Weyl's inequality. Therefore M 2 is symmetric positive definite, and 

(1 + £m 2 )S h U T M 2 U = Sh(l- £ M2 )S, 

which proves the first claim. 

Now let U± E M dx ( d ~ fc ) be a matrix with orthon ormal columns spannin g the orthogonal com- 
plement of the range of M 2 . By Wedin's theorem ([Stewart and Sum ll99Q . Theorem 4.4, p. 262) 
and the assumption that £m 2 < 1/3, 

||f/Tr/|| < \\M 2 -M 2 \\ 2 £ M2 3 1 

^±^ 2 < ^5- < h p < ^M 2 < 

Therefore C/ T C7 is non-singular, and for any u E M fc , ||t/ T £7t>||2 = 1 — II^^Ml! > 1 — (9/4)^^ , 
which in turn implies the second claim. The second claim then implies that q k \U T M 2 U] > 
S k [U T U] 2 c; k {M 2 } > (1 - / 4)£ 2 M2 )q k [M 2 ], which gives the third claim. For the final claim, ob- 
serve that -UU T )UU T \\ 2 = \\uJjjUU T \\ 2 = \\UjUh< (3/2)£ Ma , using the fact that U± and 
U have orthonormal columns, and the above displayed inequality. □ 

Recall that W = U{U T M 2 U)^/ 2 . 

Lemma 10 (Properties of whitening operators). Define W := W (W T M 2 W)^ X I 2 . Assume £m 2 < 
1/3. Then 

1. W T M 2 W is symmetric positive definite, W T M 2 W = I, and W T Adiag(w) 1 ^ 2 is orthogonal. 
2- \\W\\ 2 < —, = . 



18 



3. \\{W T M 2 W) 1 / 2 - I\\ 2 < (3/2)£ M2 , 
\\{W T M 2 W)~V 2 - I\\ 2 < (3/2)£ M2 , 
H^^ldiagH 1 ^^ < ^1 + (3/2)£ M 

\\(W - WyAdiagiwY^h < (3/2)^1 + (3/2)£ M2 £ Ah . 

Proof. By Lemma H (first and third claims), the matrices U T M 2 U and U T M 2 U are symmetric 
positive definite. Therefore 

W T M 2 W = (U T M 2 Uy 1/2 (U T M 2 U)(U T M 2 Uy 1/2 y 0, 

W = W(W T M 2 W)- 1/2 , 
W T M 2 W = {W T M 2 W)- 1/2 {W T M 2 W)(W T M 2 W)- 1/2 = I. 

Since M 2 = Adiag(w)A T , it follows from the third equation above that W T Adiag(w) 1 ^ 2 is orthog- 
onal. Thus the first claim is established. 
For the second claim, note that 

\\W\\ 2 < ||(t^M 2 tf)~ 1/2 ||2 = ^mU)- 1 ' 2 < ((1 - £ M2 MM 2 })- 1 / 2 

where the last inequality follows from Lemma [9] (first claim) . 
To show the third claim, we first bound \\W T M 2 W — I\\ 2 as 

\\W T M 2 W - I\\ 2 = \\W T (M 2 - M 2 )W\\ 2 
< \\W\\l\\M 2 -M 2 \\ 2 

~ 1 - £ m 2 ~ 2 Ah ~ 2 

where the second inequality follows from the first claim. This implies that every eigenvalue of 
W T M 2 W is contained in the interval of radius (3/2)£m 2 around 1. Because |(1 + x)~ l l 2 — 1| < \x\ 
for all |x| < 1/2, the same is true of the eigenvalues of (W T M 2 W)~ 1 / 2 : 

\\{W T M^)- 1 ' 2 - I\\ 2 < ^£ M2 . 

Furthermore, 

\\W T Adiag{w) 1/2 \\l = \\W T M 2 W\\ 2 

= \\I + W T M 2 W - I\\ 2 

< 1 + \\W T M 2 W - I\\ 2 < 1 + ~£ M2 , 

so 

|| (w - Ty) T AdiagH 1/2 || 2 = || (/ - {W T M 2 W)' 1/2 )W T Adiag{w) 1/2 \\ 2 



< \\I - {W T M 2 W)~ l l 2 \\ 2 \\W^ Admg{w) l l 2 \\ 2 < ^£ M Jl + ^£ M2 . 
This establishes the third claim. □ 
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Define T := M 3 [W,W,W] and T := M 3 [W, PF, W], both symmetric tensors in ]R fc x fc x fc . Also, 
define f[u] := M 3 [W,W,Wu] and T[u) := M 3 [W, W, Wu] , both symmetric matrices in R kxk . 

Lemma 11 (Tensor structure). The tensor T can be written as 

k 



T = V -^—(W T A&\&g(w) l/2 ei) <g> (W T A&a,g{w) 1,2 ei) ® (W T Adi&g(w" 

— J . /in- 



V2, 



1=1 



where the vectors {W T A &mg(w) l l 2 ei : i G [A;]} are orthonormal. Furthermore, the eigenvectors of 
T[u] are {W T A diag(w) 1 / 2 e.j : i G [A;]} and £/te corresponding eigenvalues are {u T W T fii : i G [A;]}. 

Proof. The structure of T follows from Lemma El and the orthogonality of {VF T ^4diag(w) 1//2 ej : 
i G [A;]} follows from Lemma [TU] (first claim). The eigendecomposition of T[u] is then readily seen 
from the structure of T. □ 

Lemma 12 (Tensor accuracy). Assume £m% < 1/3. T/ien 

||f - T|| 2 < ||M 3 [W, W, W] - M 3 [W, W, W] \\ 2 + —=S M2 . 

V w min 

Proof. By Lemma [TTI T = Yli=i w i l ^ v i ® v i ® u i f° r some orthonormal vectors {uj : i G [A;]}, so 
||T|| 2 < By LemmadO] (first and third claims) , W = W(W T M 2 W)^ 2 and \\{W T M 2 W) 1/2 - 

I\h < (3/2)^2- Therefore 

||M 3 [W, t?] - M 3 [VF, W, W] || 2 

< ||M 3 [H? - W,W,W]\\ 2 + ||m 3 [w; W - W, W]\\ 2 + ||M 3 [W, W, W - W}\\ 2 

< \\M 3 [W,W,W]\\ 2 ( \\{W T M 2 W) 1/2 - I\\ 2 \\(W T M 2 W) 1/2 \\ 2 2 



+ \\(W T M 2 W) 1/2 - I\\ 2 \\(W T M 2 W) 1/2 \\ 2 + \\(W T M 2 W) 1/2 - I\\ 2 
< \\M 3 [W,W, W}\\ 2 ((1 + (3/2)£ M2 ) 2 (3/2)£m 2 + (1 + (3/2)£ M2 )(3/2)£ M2 + (3/2)£ Al2 



< 6\\M 3 [W,W,W]\\ 2 £m 2 < -j^=£m 2 . 

Thus we can bound \\T — T\\ 2 using the triangle inequality and the above bound: 

||f -T|| 2 = \\M 3 [W,W,W}- M 3 [W,W,W}\\ 2 

< \\M 3 [W, W, W] - M 3 [W, W, W] \\ 2 + \\M 3 [W, W, W] - M 3 [W, W, W] || 2 

< \\M 3 [W,W,W]-M 3 [W,W,W]\\ 2 + —L=£ M2 . □ 

V ^min 
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C.7 Eigendecomposition analysis 

Define 



where w max := max ie[fc ] io*. 



^V^maxVe/C^ 2 J 

Lemma 13 (Random separation). Let 6 G M fe 6e a random vector distributed uniformly over the 
unit sphere in Lei Q := { e i ~ e j '■ {hj} £ (2)} ^ i e * : * ^ W} - ^ en 

1 



Pr 



mm\e T W T Aq\ > 7 
<?eQ 



> 
~ 2 



where the probability is taken with respect to the distribution of 9. 
Proof. By Lemma \T7\ with probability at least 1/2, 

■ IflTw/T A I ^ min <?eQ ll^ T ^g||2 

Now fix any i 7^ j. By Lemma [10] (first claim), W T Adia.g{w) 1 ^ 2 is orthogonal, so ||VV~ T ^4(ei — e j ) 1 1 2 = 
|| diag(u>) _1//2 (ei — e j ) 1 1 2 = || e-i/y/wl — Zj/^JWjW 2 = y/l/wi + 1/wj. Similarly, for any i G [fc], 
||VF T J 4ei||2 = y/l/wi. Therefore min ge Q ||W t j4<?||2 = min ie [ fc] y/T/wi. □ 

Let £t '■= \\T — TH2/7. Let 0i,&2, ■ ■ ■ ,0t be the random unit vectors in R fc drawn by the 
algorithm. Define f[0 t >] := M 3 [^ KF, T?^] and T[# t ,] := M 3 [W, W, W^]. Also, let A(t') := 
min{|Aj — Xj\ : i / j} U {|A»| : i G [k]} for the eigenvalues {Aj : j G [At]} of T[0 t /] ; and let 
A(i') := min{|Aj - Aj| : i / j} U {|Aj| : i G [k]} for the eigenvalues {Aj : i G [A;]} of T[0 t /]. 

Lemma 14 (Eigenvalue gap). Picfc any 5 G (0,1). If t > log 2 (l/<5), then with probability at least 
1 — 6, the trial f := argmax t / g w A(i') satisfies 

A(f ) > 7 - 2Sry. 

Proof. For each t' G [t], the eigenvalues of Ai, A 2 , • • • , A& of T[0f/] (arranged in non-increasing order) 
satisfy 

\X i -\A>\X i -XA-2\\T[0 t/ ]-T[e t ,}\\ 2 >\X i -X j \-2£ T j, i^j (2) 

where the second inequality follows from Weyl's inequality. Similarly, 

|Aj| > |Aj| - |Aj - Aj| > |Aj| - £ T J- (3) 

By Lemma[TTl the eigenvalues of T[v] are v T W T Aei for i G [k]. Thus, by Lemma [T3l the probability 
that some r G [t] has A(r) > 7 is at least 1 — 5. In this event, ([2]) implies that trial t has 
A(t) > A(r) - 2£ T 7, and hence A(f) = max^] A(t') > 7 - 2£ T ^. □ 

We now just consider the trial f retained by the algorithm. Let {(vi,Xi) : i G [k]} be the 
eigenvector /eigenvalue pairs of T[0f], and let {(vi, Xi) : i G [k]} be the eigenvector /eigenvalue pairs 
off [Of]. 
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Lemma 15 (Accuracy of eigendecomposition). Assume the 1 — 5 probability event in Lemma 14 
holds, and also assume that £t < 1/4. Then there exists a permutation tt on [k] and signs 
si, S2, ■ ■ ■ j Sfe G {±1} such that, for all i G [k], 

\\vi- Si-O^j)!^ < 4\/2£t 
I A, - A^)! < £t7- 

Proof. To simplify notation, assume the eigenvalues of T[#] and T[6] are already sorted in non- 
increasing order. Observe that for all i ^ j, 

|Aj — \j J = | A^ — Aj -l- Xj — Ajl 
— I Aj — Aj| — |Aj — Xj j 
> (7 — 2£ti) — £t! 
> 7 /4 

where the second-to-last inequality follows by the assumption A(f) > 7 — 2£t1 and by Weyl's 
inequality. Therefore, the interval of radius 7/4 surronding each eigenva lue A; of T\0j\ contains 
only one eigenvalue Aj of T{9^\. By the Davis-Kahan sin(O) theorem ([Stewart and Sunl . Il99(j 
Theorem 3.4, p. 250), we have that 



Therefore, for Sj := sign(-u7£>j), 



\\ Vi - sAWl = 2(1 - sivjvi) = 2(1 - \vjvi\) < 2(1 - sfl - (4£ T ) 2 ) < 32£|. 
The bound |Aj — Aj| < £y 7 follows simply from Weyl's inequality. □ 

C.8 Overall error analysis 

Define 

k[M 2 ] := ft [M 2 ]/ ft [M 2 ], 

e := (^>.5£m 2 + 7£t\ I v^min, 

/1.25||M 2 ||5 /2 e /V«j 
ei := ^ ^[MsjVa + hh + 7 ™^ T J/(7V^w) 

= {^mbK[M 2 } l l 2 + 2j£ M2 + (s^fil^] 1 / 2 + 7v / ^)^t)/(7v / ^w)- 

Lemma 16 (Error bound). Assume the 1 — 5 probability event of Lemma 14 holds, and also assume 
that £m 2 ^ 1/3; £t < 1/4, and e\ < 1/3. TTien i/iere exists a permutation tt on [k] such that 

1 /2 

llAvr(i) - tkh ^ 3 llwll2ei + 2|| A^f 2 1| 2 eo, i G [A;]. 
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Proof. To simplify notation, we assume throughout that the permutation tt from Lemma PT5l is the 
identity permutation. Let V := [«i|«2| " " " \ v k]- We first bound Bsifii — y^/Uj. This quantity can 
be split into two parts: the part in the range of U, and the rest. The part in the range of U is 
bounded as 

\\B8m - UU T ^W l fi l \\ 2 = \\{U T M 2 U) 1 / 2 s l v i - U T Admg(w)^ 2 e l \\ 2 

= W^mufl^SiVi - Vi ) + {{frMiU) 1 ' 2 - U T Adi a g(w) 1 / 2 V r )v i \\ 2 

< WiU^U^Usm - Vi \\ 2 + \\{U T M 2 Ufl 2 - U^Admg{w) l l 2 V^\\ 2 

< ||M 2 ||^ /2 ||^ - Vi \\ 2 + \\(U T M 2 U) 1/2 - U T Adi ag (w) 1 / 2 V T \\ 2 

1 /9 

< ((1+£m 2 )\\M 2 \\ 2 ) ±V2£ T + \\{U r M 2 U) l/2 -U T A&i & g(w) l ' 2 V T \\ 2 



where the third inequality follows from Lemma [9] and Lemma PT5l To bound the second term in the 
last step, recall that W = W (W T M 2 W)~ 1 ^ 2 (using Lemma [TU1 to guarantee the positive definiteness 
of W T M 2 W), so we may write U T AV T as 

[> T >ldiag(^) 1 / 2 y T = U T Adiag(w) 1/2 {W T Admg(w) 1 / 2 ) T 

= u T m 2 w{w t M 2 wy 1/2 

\u T M 2 U) 1/2 + U T (M 2 - %)U(U T M 2 U)- 1/2 )(W T M 2 wy 1/2 . 



Therefore 

\\(U T M 2 U) 1/2 - U T Adiag(w) 1/2 V T \\ 2 

< IKt/^^C/) 1 ^^/ _ || 2 + ||^T (7kr2 _ iV? 2 )^|| 2 ||(^^ 2 ^ ) -l/2|| 2 || ( ^? Tikr2 ^? ) -l/2|| ; 

< ||M 2 ||2 /2 ||/ - (W T M 2 W)- l/2 \\ 2 + ^ JL^ \\M 2 - M 2 \\ 2 (l + \\I - (W T M 2 Wy l/2 \\ 2 ) 

Sk[U T M 2 U\ l l 2 v ' 

<_ (1 + e M ^ lml y + 

= (1 + f„,) 1/2 ||M 2 ||J /2 (3/2)f M3 + (1 . 1 +(3 / 2) f"' ) ft [M 2 ]'/^ 

(1 - l M2 ) ' 

where the last inequality uses Lemma [9] and Lemma [TUJ Thus 
\\B Si Vi - UU T ^F ifM \\ 2 < (1 +£ M2 ) 1/2 \\M 2 \\1 /2 aV2£ t 

+ (1 + £ M , 2 f/ 2 \\M 2 \\ l l 2 (3/2)£ M , 2 + Q^y^ ^MrffiSu,. 

Now consider the part not in the range of U. This is simply bounded as 



||(/ - UU T )^UIUU T ^\\ 2 < ||7 - UU T UU T || 2 yfZHWm || 2 

< (3/2)5 M2 yfwWm h 
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using Lemma Therefore, overall, we have 

\\BVi - Siy/w~iHi\\ 2 = \\BSiVi - y/wlni\\ 2 

< (1 +£ M2 ) 1 / 2 \\M 2 \\ 1 2 /2 4V2£ T + (1 + S M2 ) 1/2 \\M 2 \\ l 2 ' 2 (?,/2)S M2 

+ (1 + (3/2) ^ ) , fc [M 2 ] 1 / 2 g M2 + (3/2)^^11^112 < \\M 2 \\l /2 eoV^- 

Since the actual estimate of fa is (Xi/9 T Vi)Bvi, we need to show that 9 T Vi is approximately 
Si^/wiXi. Indeed, 



\9 T Vi - Si^WiXil = \6 T W T (Bvi - Si^Wifa) + s t ^Fi6 T {W - W) T fa + Si^WiiXi - A;)| 

< \\W9\\ 2 \\Bvi - Sijwmh + \\(W - WyAdiagiw^eih + V^|A; - \ 



^ n m f\?2^%/2 + (3/2) Vi + (3/2)^m 2 + yfi&n < e l7 ^ 
where the last inequality uses Lemma [10] and Lemma [T5l Therefore 

y /wi\\(X i /9 T v i )Bv i - fa\\ 2 

= \\(Xi/9 T Vi)Siy/wlBVi - Siy/wlm\\2 

< \{X i /9 T v i )s i ^Jw~i- l\\\Bvi\\ 2 + \\Bvi - 8iy/wim\\2 

< \{X i s iy /W i /9 T v i ) - l\{^\\fa\\ 2 + ||M 2 ||^ /2 eo ^^w) + ||M 2 ||^ /2 eov^W 

< }^ Siy (Z\~ dT " il — (^Mll^illa + l|M 2 || 2 /2 eoV^w) + \\M 2 \\l /2 e ^B— 
\My/vH\ -\9 [ Vi- XiSiy/wl\ 

< ^-v^W (yfmWfHh + l|M 2 ||f eov 7 ^) + ||M 2 ||f e oV ^ 

\My/vH\ -\Xi- Xi\^/Wi- ei-fy/w min 

< ei7v^W (^[|^|| 2 + ||M 2 ||^ /2 eo V^) + [lAfallJ^eov^w 



— (\MIN| 2 + ||M 2 ||^ /2 e v^w) + \\M 2 \\l /2 eo^ 



7(1 - S T ) - ei7 

where the fourth inequality uses Lemma [T5l We thus conclude that 

II A* - fHh = \\(Xi/9 T Vi)Bvi - fa\\ 2 < ' 17 (|H| 2 + [|M 2 ||£ /2 eo) + ||M 2 ||fe 

7(1 - irj - ei7 

<3e 1 (|| W || 2 + ||M 2 ||^ 2 eo ) + ||M 2 ||J /2 eo 

<3||/i 4 || 2 ei + 2||M 2 ||^ /2 e . □ 

We can now prove Theorem O stated below with the explicit polynomial sample complexity 
bound (up to constants). 

Theorem [3] restated (Finite sample bound). There exists a constant C > such that the following 
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holds. Let 6 max := max ie r fc i ||/it||2- Pick any e,5 £ (0,1). Suppose the sample size 2n satisfies 



n > C- 



d + \og{k/S) 



UK, 



+ 



C 



+ c 



(fc + log(fc/£)) £ 

^min 

k + \og(k/8) 

^min 
lfolV2 



i [M 2 ]V 2 ((7 2 + ^) 
7 2w minft[M 2 ]e 

K[M 2 ] l ' 2 a 2 12 



l 2 w m ^ k [M 2 }e 



7 2 ^min?fc[M 2 ]e 



+ 



K[M 2 ] 1 / 2 (7 2 
7 2 ^minft[^2]e 



+ 



k[M2\ 1 I 2 0~ 



+ 



<7 





7 2 V W min£ ft[M 2 ]V2 



max 



{l,a 2 A fc [M 2 ]} 



fcio g (i/<y) 



k[M 2 ]V 2 



maxjl,<7 2 / ?fc [M 2 ]} 



- 2 


k[M 2 ]V2- 


2 


+ 




+ 




. 7 2 ^min e - 





k[M 2 ] 1 / 2 ct 2 

7 2 ^min?fe[M 2 ]e 



where 



7 



2 v ^wV^( fc 2 hl ) 



fas defined in Lemma [73|). T/ien toit/i probability at least 1 — 3(5 ewer £/ie random sample and the 
internal randomness of the algorithm, there exists a permutation n on [k] such that 



I/Mi) - Milb < (iHh + ||M 2 ||2 /2 )e 



/or a// i € [fc] . 



Proof. Throughout, Ci, ci, C 2 , c 2 , . . . will represent absolute positive constants. First, observe that 
the sample size bound 

n >C-k\og{l/5) 

and Lemma [H ensure that £ w < 1 (where 5^ is defined in Lemma [6]). Therefore, from Lemma [5] 
and Lemma [6] (together with union bounds), with probability at least 1 — 6, 



\\M 2 - M 2 \\ 2 < cJ a 2 



d+]og{k/5) /klog(l/5) 



Wrn\nri 



n 



U + \og(k/5) 2 d + log(k/5) 
1- cr h crft m ax1 



l d + \og(k/5) \ 



w m - m n 



+ ci(«4 ax + <7' 



fcio g (i/a) 



<Ci 1.7 cj 2 + 6: 



<> 2 

max 



' d + log(fc/J) j2 d + log(A;/J) 
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Therefore, by Lemma [71 

max{|<7 2 - a\ \\M 2 - M 2 \\ 2 } < 4Ci \l-7[o* + b 

+ 4C 1 yi 2 ( - 



2 , z.2 

max 



/ d + log(fc/^) + j2 d + log(fc/J) 



d + log(fc/J) / fclogCl/J) ^ 
r PmaxA/ 



The sample size bound 

d + log(fc/<5) 



n > C- 
ensures that 



UK, 



+ 2Cf U 
<C 2 (a 2 + 6 max )^ 



d + log(fc/<5) 



fcio g (i/a) 



n 



'd + log(fc/<5) d + log(fc/(5) 



+ 



+ 



K [M 2 ]V 2 (a 2 + bL 
7 2 ^minft[M 2 ]e 



max i — ncTT'^a r - c i r^u/ 2 £ ^ i/ 3 - 



i-2 ^.2 1 

ft[M 2 j ' fM2 /- Cl ^M 2 ]V2' 

Now condition on the above event and take W as given. By Lemma [T0| 

||W|| 2 < v/l-5/ft[M 2 ], 

max||W? T ^|| 2 < H^AdiagH^Ha/v^W < v^-V^mm, 
»e[fc] 

max||W? T 7W 2ii W|| 2 < 1.5/ ft [Af 2 ] + 1.5/uw, 

ie[fc] 

max||A<3,i[W ? ) W,TV]||2 < (1.5/uw) 3 / 2 + 3ff 2 v^L5M~1.5/ft [M 2 ]. 

ie[fc] 

Therefore, Lemma [5] and Lemma [6] imply that with probability at least 1 — 8, 



\\W T (ji-ri\\ 2 <C 3 



\\M 3 [W,W,W] -M 3 [W,W,W}\\ 2 < C 3 



ft[M 2 ]V2 y w mhi n 

3 



k + \og{k/5) +C3 _ 1 Jklog(l/S) 



ff 



'(fc + log(£;/<5)) 3 



ft[M 2 ]3/2 



+ c 3 

+ c 3 
+ c 3 



ff 



\A^minft 

[M 2 ] 

cr 



'fc + log(fc/o) fc + log(fc/*) 



+ 



Wm\r\Tl 



Wrin\r\Tl 



l k + \og{k/5) 



1 ft[M 2 ]V2 W w , 



,n 



k log {1/5) 



( 1 | a 2 ' 

V™ 3 /. 2 VuWft[M 2 ] / V n 

\ mm v / 
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Therefore, by Lemma [8] and Lemma IT2l 

||T-T|| 2 < \\M 3 [W,W,W]-M 3 [W,W,W]\\ 2 



+ 



4.5 



ft[M 2 ] 
The sample size bound 

k + \og(k/d) 



\\w T (fi-^\\ 2 + ^UJ^ 



\CT -a\ + 



l.5a 2 

ft[M 2 ] 



||W T (A-^)|| 2 . (5) 



n > C- 



+ C 



^min 

Hog(l/<$) 



k[M 2 \ x I 2 a 2 
^M,] 1 / 2 

7 2 i/Wmin£ 



max 



{l,(X 2 /ft[M 2 ]} 



maxjl, cr 2 / ft [M 2 ]} 



ensures 



maxs 



c{l,a /ft[M 2 ] ) <c 2 — 

Furthermore, the sample size bound 

{k + \og{k/8)f \ k[M 2 ] 1 I 2 <j z 



«[M 2 ]V2 



e < 1. 



n > C- 



+ C 
+ C 



fc + log(fc/<?) 

^min 

+ log(fc/<?) 

^min 



7 2 v^WftW /2 £ 

l 2 w min <; k [M 2 ]e 
k\M 2 \ x I 2 o 



+ 



K[M 2 ] 1 / 2 (7 2 
7 2 ^min?fc[M 2 ]e 



+ C- Hog(l/<5) 



7 2 ^ 2 ft[M 2 ]V2 e 
«[M 2 ]V2] 2 p K 



7 2 ^min £ 



+ 



[M 2 ]V2 a 2 



7 2 ^minft[M2]e 



ensures 



7 " C2 /e[M2]V2 e - 



Using the inequalities ([6]) , ((TJ) , and dH with (J5J) gives 



_ ||r-r|| 2 7^^w 

7 - C3 ^[M 2 ]V2 £ - 



(6) 



(7) 



(8) 



Since t = log 2 (l/<5), the inequality from Lemma [T4l holds with probability at least 1 — 5 over the 
internal randomness of the algorithm. Therefore, using (j4j) and ([8]) with Lemma [16] gives in this 
event: 



f*ih ^ ^4 • 1 1 A** ] 1 2 • f^M 2 



k[M 2 ]V 2 

~fy/Wmm 

+ C 4 -\\M 2 \\l /2 -(£ M2 +£ T ).—L= 

v v "'mil 



< (||^||2 + ||M 2 ||f 

for all i G [fe], for some permutation 7r on [fc]. 



□ 
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D Probability tail inequalities 



We recall and derive some probability tail inequalities used in the analysis. 

Lemma 17 ( Dasgupta and Gupta . 20031 ; Anandkumar et al. . 2012bh . Pick any 5 G (0, 1), matrix 
X G R pxp , and finite subset Q C MP. If 6 £ M p be a random vector distributed uniformly over the 
unit sphere in MP, then 

min gg Q ||Xg|| 2 • 5' 



Pr 



mm\9 T Xq\ > 

q&Q 



ep\Q\ 



> 1-5. 



Lemma 18 ( Laurent and Massart . 2000l ). Let z\, z\ , • • ■ , be i.i.d. x 2 random variables, each 
with one degree of freedom. Then for any 6 € (0, 1), 



Pr 



^zf > m + 2 v / mln(l/<5) + 2 ln(l/<5) 



2=1 



< 8. 



Lemma 19 (|Litvak et all 120051 : iHsu et all l2012bl ). Let yi,y2,... y m be i.i.d. N(0,I) random 
vectors in MP. Then for any eo G (1, 1/2) and 8 G (0, 1), 



Pr 



^ m 

— J2 ViVi ~ 1 
777 



i=l 



1 

2 > l-2e 



32 ln((l + 2/e )P/S) 2 ln((l + 2/e ) p /8) 



+ 



m 



m 



< 8. 



Lemma 20 (Sums of cubes of normal random variables). Let z\, Z2, ■ ■ ■ , z m be i.i.d. N(0, 1) random 
variables. Then for any 5 G (0, 1), 



Pr 



E 

4=1 



> V27e 3 777rin(l/(5)1 : - 



< 5. 



Proof. We use Markov's inequality via the p-th moment to derive the tail inequality. Pick some 
even integer p > 2, and observe that 



E 



m ^ p 

,3 



i=l 



r P 



e e ii 

ii,i2,...,i p £[m] j=l 



By the independence of the Zj's, a term in the summation is zero if any index i G [m] is selected 
an odd number of times (i.e., \{j G [p] : ij = i}\ is odd, for any i G [m] ). Therefore the summation 
can be written as 



e 

PlH hPm=p/2 1=1 



6p, 



e iw 



6pi 



PlH hPm=p/2 4=1 



e rife - *)» 

PlH hPm=p/2 4=1 



where the summations are over non-negative integers Pi,P2, ■ ■ ■ >Pm that sum to p/2, and n\\ is 
the product of all odd integers between 1 and n; the last step uses the well-known fact that 
K[z k ] = (k — 1)!! for a standard normal random variable z. As pi < p/2 for each i G [m], the 
product can be crudely bounded by (3p) 3p / 2 , and hence the sum is bounded by 

(3pf^ ( P/2 + ™ " X ) < (27p^ ^ Pl2+ p ™~ l)e ) P ' 2 < {2Wmfl\ 
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By Markov's inequality, for any t > 0, 



Pr 



£ 

i=l 



> t 



< t- p E 



i=i 



< 



■m 



The bound is at most 5 for t := ey / 27em |~ln(l/(5)] 3 and p := [ln(l/<5)] . 



□ 



Lemma 21 (Third-order tensor of normal random vectors). Let yi,y2, ■ ■ ■ ,y m be i.i.d. N(0,I) 
random vectors in MP. Then for any eo £ (1, 1/3) and 5 £ (0, 1), 



Pr 



j m 

m 



i=i 



1 

2 1 - 3e Q 



27e 3 rin((l + 2/e )f/5)l 3 



< S. 



Proof. We follow the covering approach of iLitvak et al. (|2005l ). Let Q C {x £ W : ||x|| 2 = 1} 
be an eo-cover of the unit sphere in M. p of cardinality a t most (1 + 2/eq) p , which can be shown 
to exist by a standard volume argument (Pisier, 19891 ). Let Y := fn^Y^lLiVi ® Hi ® Vi an d 
e := (27e 3 [~ln(|Q|/<5)] 3 /m) 1//2 . Since yjq is distributed as iV(0, 1) for any q & Q, it follows from 
Lemma [20] and a union bound that Pr[3q 6 Q . | K [(?,(?, q]\ > e] < 5. Henceforth we assume 
Vq £ Q . \Y[q, q, q] < e. Now pick any unit vector u such that |£[-u, li, u]| is maximized (i.e., 
\\Y\\2 = \Y[u, u, u]\), choose q £ Q such that \\q — u\\2 < eo, and set A := u — q and A := A/||A||2. 
Then 

||Y|| 2 = \Y[u,u,u}\ 

= \Y[A, u, u] + Y[q, A, u] + Y[q, q, A] + Y[q, q, q}\ 
< e (Y[A, u, u] + Y[q, A, u] + Y[q, q, A]) + e 

by the triangle inequality and facts ||A||2 < eo an d |£[<7> <7> <7]| < e. Since Y has the form Y = 
^2T=i Vj ® Vj ® Vj f° r vectors jjj := m~ l ^yj £ M. m , it follows that 



sup |Y[ii, 

IM|2=|h> 1 1 2 = lllO II 2 =1 



sup 

l u l|2 = ||f 1 1 2 = 1 1 1 1 2 = 1 

sup 

\\u ||2 = ||f 1 1 2 = 1 1 ttJ 1 1 2 = 1 ' 

sup ||ydiag(w T y)y T || 2 

IIH|2 = 1 

sup \Y[u, u, w]\ 

I|m||2=||«'||2=i 



^2{u T y j )(v T y j )(w T y j ) 
u T Ydiag(w T Y)Y T v 



where Y = [yi\y 2 \ ■ ■ ■ \y m ] £ 



-i.e., we can take the unit vectors u and v achieving \Y[u, v, w]\ 



\\Y\\2 to be the same. By symmetry, sup|| n || 2=1 \Y[u, u, u]\ = ||£|| 2 - Therefore 

||F|| 2 < e {Y [A, u, u] + Y[q, A, u] + Y[q, q, A]) + e < 3e ||Y|| 2 + e 
which implies ||£|| 2 < e/(l — 3eo). This proves that 

Pr[||Y||a < e/(l " 3e )] > Pr[V g £ Q . \Y[q,q,q]\ < e] > 1 - S 

as required. 



□ 
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